This is not original material but a compilation of the following sources with some additional examples and clarifications introduced by the professor:
We will work with the fpp3 package, developed by Prof. Rob Hyndman and his team. They are actually the main contributors to the time series community, so do not forget their name.
This package extends the tidyverse framework to time series. For that it requires to load some packages under the hood (see output of the following cell):
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.2.2
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✔ tibble 3.1.8 ✔ tsibble 1.1.3
## ✔ dplyr 1.0.10 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.2.1 ✔ feasts 0.3.0
## ✔ lubridate 1.8.0 ✔ fable 0.3.2
## ✔ ggplot2 3.3.6
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
Additionally, for this particular session we will use the packages GGally, fma and patchwork. Install them if you do not have them already:
install.packages("GGally") # to generate a scatterplot matrix
install.packages("fma") # to load the Us treasury bills dataset
install.packages("patchwork") # Used to manage the relative location of ggplots
library(patchwork) # Used to manage the relative location of ggplots
library(GGally)
library(fma) # to load the Us treasury bills dataset
| Pattern | Description |
|---|---|
| Trend | Long-term increase OR decrease in the data |
| Seasonal | Periodic pattern due to the calendar (quarter, month, day of the week…) |
| Cyclic | Pattern where data exhibits rises and falls that are NOT FIXED IN PERIOD (typically duration of at least 2 years) |
| Seasonal | Cyclic |
|---|---|
| Seasonal pattern of constant length | Cyclic patern of variable length |
| Shorter than cyclic pattern | Longer than periodic pattern |
| Magnitude less variable than cyclic pattern | Magnitude more variable than periodic pattern |
Let us look at some examples:
aus_production %>%
# Filter data to show appropriate timeframe
filter((year(Quarter) >= 1980 & year(Quarter)<= 2000)) %>%
autoplot(Electricity) +
# Scale the x axis adequately
scale_x_yearquarter(date_breaks = "1 year",
minor_breaks = "1 year") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
Strong trend with a change in the slope of the trend around 1990.
autoplot() as a wrapper around ggplot2 objectsIMPORTANTLY the function autoplot() of the library fpp3 is nothing but a wrapper around ggplot2 objects. Depending on the object fed to autoplot, it will have one or other behavior combining different ggplot objects.
In the example above, autoplot() is nothing but a wrapper around geom_line(). It is just that using autoplot() is much more convenient and it is a good idea to have such a function in a time series dedicated library such as fpp3
aus_production %>%
filter((year(Quarter) >= 1980 & year(Quarter)<= 2000)) %>%
# The two lines below are equivalent to autoplot(Electricity)
ggplot(aes(x = Quarter, y = Electricity)) +
geom_line() +
# Scale the x axis adequately
scale_x_yearquarter(date_breaks = "1 year",
minor_breaks = "1 year") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
We will use autoplot() extensively during the course and it will behave in a different manner depending on the object we feed to the function.
You only need to be aware that since it is a wrapper around ggplot objects, you may combine it with other ggplot2 commands, which is something we are doing above to, for example, rotate the x-labels and scale the x-axis grid.
Scale the x-grid of the following graph so that the end of each year is clearly visible
aus_production %>% autoplot(Bricks)
## Warning: Removed 20 row(s) containing missing values (geom_path).
US treasury bill contracts on the Chicago market for 100 consecutive trading days in 1981.
See Treasury Bills.
ustreas
## Time Series:
## Start = 1
## End = 100
## Frequency = 1
## [1] 91.57 91.56 91.40 91.39 91.20 90.90 90.99 91.17 90.98 91.11 91.20 90.92
## [13] 90.73 90.79 90.86 90.83 90.80 90.21 90.10 90.10 89.66 89.81 89.40 89.34
## [25] 89.16 88.71 89.28 89.36 89.64 89.53 89.32 89.14 89.41 89.53 89.16 88.98
## [37] 88.50 88.63 88.62 88.76 89.07 88.47 88.41 88.57 88.23 87.93 87.88 88.18
## [49] 88.04 88.18 88.78 89.29 89.14 89.14 89.42 89.26 89.37 89.51 89.66 89.39
## [61] 89.02 89.05 88.97 88.57 88.44 88.52 87.92 87.71 87.52 87.37 87.27 87.07
## [73] 86.83 86.88 87.48 87.30 87.87 87.85 87.83 87.40 86.89 87.38 87.48 87.21
## [85] 87.02 87.10 87.02 87.07 86.74 86.35 86.32 86.77 86.77 86.49 86.02 85.52
## [97] 85.02 84.42 85.29 85.32
If you look at this, this is a ts object (time-series). This is another possible format for time series in R, but remember we are going to use tsibles to allow for seamless functionality with the fpp3 library. You can convert the ts to a tsibble using as_tsibble:
ustreas_tsibble <- as_tsibble(ustreas)
ustreas_tsibble
## # A tsibble: 100 x 2 [1]
## index value
## <dbl> <dbl>
## 1 1 91.6
## 2 2 91.6
## 3 3 91.4
## 4 4 91.4
## 5 5 91.2
## 6 6 90.9
## 7 7 91.0
## 8 8 91.2
## 9 9 91.0
## 10 10 91.1
## # … with 90 more rows
autoplot(ustreas_tsibble)
## Plot variable not specified, automatically selected `.vars = value`
Looks like a downward trend, but it is part of a much longer cycle (not shown). You do not want to make forecasts with this data too long into the future.
scale_x_continuous: adjusting the grid when the index is numeric.Examining our tsibble reveals that the index is numeric (trading day)
ustreas_tsibble %>% head(5)
## # A tsibble: 5 x 2 [1]
## index value
## <dbl> <dbl>
## 1 1 91.6
## 2 2 91.6
## 3 3 91.4
## 4 4 91.4
## 5 5 91.2
For this reason, we need to use the function scale_x_continuous. In addition to this, we need to generate a specific sequence signaling the ticks. For example, if we want major ticks every 10 days and minor ticks every 5 days
major_ticks_seq = seq(0, max(ustreas_tsibble$index), 10)
major_ticks_seq
## [1] 0 10 20 30 40 50 60 70 80 90 100
minor_ticks_seq = seq(0, max(ustreas_tsibble$index), 5)
minor_ticks_seq
## [1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
## [20] 95 100
ustreas_tsibble %>%
autoplot() +
scale_x_continuous(breaks = major_ticks_seq,
minor_breaks = minor_ticks_seq)
## Plot variable not specified, automatically selected `.vars = value`
ANUAL DATA is not susceptible to calendar variations. It is sampled at the same point every year. Calendar variations might be present but they are not captured by yearly data. Therefore there is no seasonality in yearly data. If anything there could be cyclic behavior.
Hudson Bay Company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. This data contains trade records for all areas of the company. We are going analyse the number of Canadian Lynx pelts traded.
pelt %>% head(5)
## # A tsibble: 5 x 3 [1Y]
## Year Hare Lynx
## <dbl> <dbl> <dbl>
## 1 1845 19580 30090
## 2 1846 19600 45150
## 3 1847 19610 49150
## 4 1848 11990 39520
## 5 1849 28040 21230
lynx <- pelt %>% select(Year, Lynx)
lynx %>% head(5)
## # A tsibble: 5 x 2 [1Y]
## Year Lynx
## <dbl> <dbl>
## 1 1845 30090
## 2 1846 45150
## 3 1847 49150
## 4 1848 39520
## 5 1849 21230
Plot the time series ensuring we have major ticks every 10 years and minor ticks every 5 years:
Monthly sales of new one-family houses sold in the USA since 1973.
hsales_tsibble <- hsales %>% as_tsibble()
Create a timeplot that has major ticks every 5 years and minor ticks every year
hsales_tsibble %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = value`
Half-hourly electricity demand in England and Wales from Monday 5 June 2000 to Sunday 27 August 2000. Discussed in Taylor (2003), and kindly provided by James W Taylor. Units: Megawatts
autoplot(taylor)
IMPORTANT: the electricity demand does not always follow such an extremely regular pattern. There might be weekly seasonality, but it is not necessarily so regular.
Half-hourly electricity demand in for Victoria (Australia).
vic_elec %>% head(5)
## # A tsibble: 5 x 5 [30m] <Australia/Melbourne>
## Time Demand Temperature Date Holiday
## <dttm> <dbl> <dbl> <date> <lgl>
## 1 2012-01-01 00:00:00 4383. 21.4 2012-01-01 TRUE
## 2 2012-01-01 00:30:00 4263. 21.0 2012-01-01 TRUE
## 3 2012-01-01 01:00:00 4049. 20.7 2012-01-01 TRUE
## 4 2012-01-01 01:30:00 3878. 20.6 2012-01-01 TRUE
## 5 2012-01-01 02:00:00 4036. 20.4 2012-01-01 TRUE
elec_jan <- vic_elec %>%
mutate(month = yearmonth(Time)) %>%
filter(month == yearmonth("2012 Jan"))
scale_x_datetime indexes that are date-time objectsCreate a time-plot of this data using scale_x_datetime to establish major breaks of 1 week and minor breaks of 1 day.
elec_jan %>%
autoplot() +
scale_x_datetime(breaks = "1 week",
minor_breaks = "1 day")
## Plot variable not specified, automatically selected `.vars = Demand`
PBS contains monthly medicare australia prescription data. We are going to select those regarding drugs of the category ATC2. For further info run ?PBS in the console.
PBS %>%
filter(ATC2 == "A10") %>%
select(Month, Concession, Type, Cost) %>%
summarise(TotalC = sum(Cost)) %>%
# Comment on this assignment operator
mutate(Cost = TotalC / 1e6) -> a10
a10 %>% head(5)
## # A tsibble: 5 x 3 [1M]
## Month TotalC Cost
## <mth> <dbl> <dbl>
## 1 1991 Jul 3526591 3.53
## 2 1991 Aug 3180891 3.18
## 3 1991 Sep 3252221 3.25
## 4 1991 Oct 3611003 3.61
## 5 1991 Nov 3565869 3.57
autoplot(a10, Cost) +
labs(y = "$ (millions)",
title = "Australian antidiabetic drug sales") +
scale_x_yearmonth(breaks = "1 year") +
theme(axis.text.x = element_text(angle = 90))
similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed. Essentially the time axis is limited to a single season (in this case, a year).
The period of a time series is always of higher order than the sampling frequency of the time series. In this case we have quarterly data, so the natural period to consider is a year, which covers all the calendar variations and is of higher order than the data considered. We will deal with seasonality and periods in more detail in the coming lessons.
a10 %>%
gg_season(Cost, labels = "both") + # Labels -> "both", "right", "left"
labs(y = "$ (millions)",
title = "Seasonal plot: Antidiabetic drug sales")
vic_elec contains half-hourly electricity demand for the state of Victoria (Australia) for three years. Because data is sampled every half-hour, we can find multiple seasonal patterns in the data:
We can use the argument period to select the period we are interested in:
head(vic_elec)
## # A tsibble: 6 x 5 [30m] <Australia/Melbourne>
## Time Demand Temperature Date Holiday
## <dttm> <dbl> <dbl> <date> <lgl>
## 1 2012-01-01 00:00:00 4383. 21.4 2012-01-01 TRUE
## 2 2012-01-01 00:30:00 4263. 21.0 2012-01-01 TRUE
## 3 2012-01-01 01:00:00 4049. 20.7 2012-01-01 TRUE
## 4 2012-01-01 01:30:00 3878. 20.6 2012-01-01 TRUE
## 5 2012-01-01 02:00:00 4036. 20.4 2012-01-01 TRUE
## 6 2012-01-01 02:30:00 3866. 20.2 2012-01-01 TRUE
tail(vic_elec)
## # A tsibble: 6 x 5 [30m] <Australia/Melbourne>
## Time Demand Temperature Date Holiday
## <dttm> <dbl> <dbl> <date> <lgl>
## 1 2014-12-31 21:00:00 3928. 20.3 2014-12-31 FALSE
## 2 2014-12-31 21:30:00 3873. 19 2014-12-31 FALSE
## 3 2014-12-31 22:00:00 3792. 18.5 2014-12-31 FALSE
## 4 2014-12-31 22:30:00 3725. 17.7 2014-12-31 FALSE
## 5 2014-12-31 23:00:00 3762. 17.3 2014-12-31 FALSE
## 6 2014-12-31 23:30:00 3809. 17.1 2014-12-31 FALSE
vic_elec %>% gg_season(Demand, period = "day") +
theme(legend.position = "none") +
labs(y="MWh", title="Electricity demand: Victoria")
vic_elec %>% gg_season(Demand, period = "week") +
theme(legend.position = "none") +
labs(y="MWh", title="Electricity demand: Victoria")
vic_elec %>% gg_season(Demand, period = "year") +
labs(y="MWh", title="Electricity demand: Victoria")
The aus_arrivals (loaded with the fpp3 library) data set comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US.
Create a:
autoplot() (one curve per country of origin)gg_season() (due to the structure of the data it will produce one graph per country of origin)gg_subseries() plot that shows the evolution of the quarterly arrivals per country over the time period considered (see next section on gg_subseries())aus_arrivals
## # A tsibble: 508 x 3 [1Q]
## # Key: Origin [4]
## Quarter Origin Arrivals
## <qtr> <chr> <int>
## 1 1981 Q1 Japan 14763
## 2 1981 Q2 Japan 9321
## 3 1981 Q3 Japan 10166
## 4 1981 Q4 Japan 19509
## 5 1982 Q1 Japan 17117
## 6 1982 Q2 Japan 10617
## 7 1982 Q3 Japan 11737
## 8 1982 Q4 Japan 20961
## 9 1983 Q1 Japan 20671
## 10 1983 Q2 Japan 12235
## # … with 498 more rows
gg_subseries()Alternative plot to emphasize seasonal patterns. The data for each season are collected together in separate mini time plots, printing one plot for each of the seasonal divisions. All this is best understood through examples:
a10
## # A tsibble: 204 x 3 [1M]
## Month TotalC Cost
## <mth> <dbl> <dbl>
## 1 1991 Jul 3526591 3.53
## 2 1991 Aug 3180891 3.18
## 3 1991 Sep 3252221 3.25
## 4 1991 Oct 3611003 3.61
## 5 1991 Nov 3565869 3.57
## 6 1991 Dec 4306371 4.31
## 7 1992 Jan 5088335 5.09
## 8 1992 Feb 2814520 2.81
## 9 1992 Mar 2985811 2.99
## 10 1992 Apr 3204780 3.20
## # … with 194 more rows
a10 %>%
gg_subseries(Cost) +
labs(
y = "$ (millions)",
title = "Australian antidiabetic drug sales",
)
Why does it produce this output?
Because of the date format: Year-Month.
As you can see, specifying “year” as the period is simply redundant in this case.
a10 %>%
gg_subseries(Cost, period = "year") +
labs(
y = "$ (millions)",
title = "Australian antidiabetic drug sales",
)
Question: what will the output of this code be?
a10 %>%
gg_subseries(Cost, period = "month") +
labs(
y = "$ (millions)",
title = "Australian antidiabetic drug sales",
)
Question what is the output of this code?
a10 %>%
gg_subseries(Cost, period = "day") +
labs(
y = "$ (millions)",
title = "Australian antidiabetic drug sales",
)
NOTE: this example is not extremely important from a mathematical point of view, but rather to be able to understand the logic behind sub-seasonal plots.
head(vic_elec)
## # A tsibble: 6 x 5 [30m] <Australia/Melbourne>
## Time Demand Temperature Date Holiday
## <dttm> <dbl> <dbl> <date> <lgl>
## 1 2012-01-01 00:00:00 4383. 21.4 2012-01-01 TRUE
## 2 2012-01-01 00:30:00 4263. 21.0 2012-01-01 TRUE
## 3 2012-01-01 01:00:00 4049. 20.7 2012-01-01 TRUE
## 4 2012-01-01 01:30:00 3878. 20.6 2012-01-01 TRUE
## 5 2012-01-01 02:00:00 4036. 20.4 2012-01-01 TRUE
## 6 2012-01-01 02:30:00 3866. 20.2 2012-01-01 TRUE
tail(vic_elec)
## # A tsibble: 6 x 5 [30m] <Australia/Melbourne>
## Time Demand Temperature Date Holiday
## <dttm> <dbl> <dbl> <date> <lgl>
## 1 2014-12-31 21:00:00 3928. 20.3 2014-12-31 FALSE
## 2 2014-12-31 21:30:00 3873. 19 2014-12-31 FALSE
## 3 2014-12-31 22:00:00 3792. 18.5 2014-12-31 FALSE
## 4 2014-12-31 22:30:00 3725. 17.7 2014-12-31 FALSE
## 5 2014-12-31 23:00:00 3762. 17.3 2014-12-31 FALSE
## 6 2014-12-31 23:30:00 3809. 17.1 2014-12-31 FALSE
Question: what would be the output of this code? DO NOT RUN THIS CELL. It might crash your computer.
vic_elec %>%
gg_subseries(Demand) +
labs(
y = "Demand [MW]",
title = "Half-hourly energy demand in the state of Victoria",
)
It would take the greatest time unit in the index (year) and split it by the smallest time unit in the index (half-an hour). Since there are 365 * 24 * 2 = 17520 half-hours in a year, it would create 17520 graphs. This would be useless, unreadable and would probably crash your computer.
Takeaway: we need to be careful when we want to use the gg_subseries() function with sub-daily data.
A first possibility would be to specify period = “day”. This is a first option if we have sub-daily data because a day is the immediately superior time unit if we are dealing with hourly data.
Question: what is the output of this code?
vic_elec %>%
gg_subseries(Demand, period = "day") +
labs(
y = "[MW]",
title = "Total electricity demand for Victoria, Australia"
)
# Necessary to explore the output given the density of the data
ggsave("subseasonal_demand_hourly_PeriodDay.svg", width = 30, height = 5)
Question: what is the output of this code?
vic_elec_m <- vic_elec %>%
index_by(month = yearmonth(Time)) %>%
summarise(
avg_demand = mean(Demand)
)
vic_elec_m %>%
gg_subseries(avg_demand, period = "year") +
labs(
y = "[MW]",
title = "Total electricity demand for Victoria, Australia"
)
Question: what is the output of this code?
We could also aggregate by quarter
vic_elec_q <- vic_elec %>%
index_by(quarter = yearquarter(Time)) %>%
summarise(
avg_demand = mean(Demand)
)
vic_elec_q %>%
gg_subseries(avg_demand, period = "year") +
labs(
y = "[MW]",
title = "Total electricity demand for Victoria, Australia"
)
index_by() + summarise() combined with mean()gg_subseries() plot that has one graph for each day of the month and plots the values on each day for each month only for the year 2012See expected output on the separate file subseasonal_demand_daily_PeriodMonth.svg
Question: what would be the output of this code?
vic_elec_d <- vic_elec %>%
index_by(day = as_date(Time)) %>%
summarise(
avg_demand = mean(Demand)
)
# Filter values for 2012
filter(vic_elec_d, year(day) == 2012) %>%
# Create gg subseries plot
gg_subseries(avg_demand, period = "year") +
labs(
y = "[MW]",
title = "Total electricity demand for Victoria, Australia"
)
ggsave("subseasonal_demand_daily_PeriodYear.svg", width = 40, height = 5)
vic_elec
## # A tsibble: 52,608 x 5 [30m] <Australia/Melbourne>
## Time Demand Temperature Date Holiday
## <dttm> <dbl> <dbl> <date> <lgl>
## 1 2012-01-01 00:00:00 4383. 21.4 2012-01-01 TRUE
## 2 2012-01-01 00:30:00 4263. 21.0 2012-01-01 TRUE
## 3 2012-01-01 01:00:00 4049. 20.7 2012-01-01 TRUE
## 4 2012-01-01 01:30:00 3878. 20.6 2012-01-01 TRUE
## 5 2012-01-01 02:00:00 4036. 20.4 2012-01-01 TRUE
## 6 2012-01-01 02:30:00 3866. 20.2 2012-01-01 TRUE
## 7 2012-01-01 03:00:00 3694. 20.1 2012-01-01 TRUE
## 8 2012-01-01 03:30:00 3562. 19.6 2012-01-01 TRUE
## 9 2012-01-01 04:00:00 3433. 19.1 2012-01-01 TRUE
## 10 2012-01-01 04:30:00 3359. 19.0 2012-01-01 TRUE
## # … with 52,598 more rows
p1 <- vic_elec %>%
filter(year(Time) == 2014) %>%
autoplot(Demand) +
labs(y = "GW",
title = "Half-hourly electricity demand: Victoria")
p2 <- vic_elec %>%
filter(year(Time) == 2014) %>%
autoplot(Temperature) +
labs(
y = "Degrees Celsius",
title = "Half-hourly temperatures: Melbourne, Australia"
)
p1 / p2
vic_elec %>%
filter(year(Time) == 2014) %>%
ggplot(aes(x = Temperature, y = Demand)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(method = "loess", color = "red", se = FALSE) +
labs(x = "Temperature (degrees Celsius)",
y = "Electricity demand (GW)")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
The correlation coefficient only measures the strength of linear relationship. For example, the correlation for the electricity demand and temperature data shown in the previous figure is only 0.26, but their non-linear relationship is actually much stronger:
# Compute correlation coefficient
round(cor(vic_elec$Temperature, vic_elec$Demand), 2)
## [1] 0.26
Remember that a correlation coefficient of 0 does not imply that there is no relationship. It just implies that there is no linear relationship.
Note: beware that there are very bad sources on statistics and data science on the internet:
Let us see a specific toy example that will you should keep in your head forever:
df <- tibble(
x = seq(-4, 4, 0.05),
y = x^2
)
df %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
cor(df$x, df$y)
## [1] 1.627669e-16
QUESTION: Why is it not exactly 0?
Other examples of variables with a correlation coefficient of 0:
Figure 1: Examples of corr coefficient = 0 (Wikipedia)
Remember as well that two variables may have a very similar correlation coefficent yet have a very different aspect. The example below (ref [1]) shows scatterplots of different variables, all of which have a correlation coefficient of 0.82:
Figure 2: Examples of variables having the same correlation coefficient. From [1]
The image above is known as Ancombe’s quartet. There are infinately many such datapoints with the same correlation coefficient. As an example see the funny .gif image below:
Datasaurus - from ref. 3
All these examples are meant to make you aware of how important it is to depict a scatterplot of the variables and not rely only on the correlation coefficient, which is a summary only of their linear relationship.
visitors <- tourism %>%
group_by(State) %>%
summarise(Trips = sum(Trips))
visitors
## # A tsibble: 640 x 3 [1Q]
## # Key: State [8]
## State Quarter Trips
## <chr> <qtr> <dbl>
## 1 ACT 1998 Q1 551.
## 2 ACT 1998 Q2 416.
## 3 ACT 1998 Q3 436.
## 4 ACT 1998 Q4 450.
## 5 ACT 1999 Q1 379.
## 6 ACT 1999 Q2 558.
## 7 ACT 1999 Q3 449.
## 8 ACT 1999 Q4 595.
## 9 ACT 2000 Q1 600.
## 10 ACT 2000 Q2 557.
## # … with 630 more rows
# Check distinct values of column "State"
distinct(visitors, State)
## # A tibble: 8 × 1
## State
## <chr>
## 1 ACT
## 2 New South Wales
## 3 Northern Territory
## 4 Queensland
## 5 South Australia
## 6 Tasmania
## 7 Victoria
## 8 Western Australia
visitors %>%
pivot_wider(values_from=Trips, names_from=State)
## # A tsibble: 80 x 9 [1Q]
## Quarter ACT New South Wal…¹ North…² Queen…³ South…⁴ Tasma…⁵ Victo…⁶ Weste…⁷
## <qtr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1998 Q1 551. 8040. 181. 4041. 1735. 982. 6010. 1641.
## 2 1998 Q2 416. 7166. 314. 3968. 1395. 693. 4795. 1576.
## 3 1998 Q3 436. 6748. 528. 4594. 1213. 402. 4317. 1588.
## 4 1998 Q4 450. 7282. 248. 4203. 1453. 681. 4675. 1840.
## 5 1999 Q1 379. 7585. 185. 4332. 1541. 925. 5304. 1836.
## 6 1999 Q2 558. 7054. 366. 4824. 1636. 621. 4562. 1837.
## 7 1999 Q3 449. 6724. 501. 5018. 1283. 430. 3784. 1725.
## 8 1999 Q4 595. 7136. 248. 4350. 1387. 592. 4201. 1519.
## 9 2000 Q1 600. 7296. 206. 4413. 1833. 789. 5567. 1636.
## 10 2000 Q2 557. 6445. 360. 4344. 1415. 509. 4502. 1808.
## # … with 70 more rows, and abbreviated variable names ¹`New South Wales`,
## # ²`Northern Territory`, ³Queensland, ⁴`South Australia`, ⁵Tasmania,
## # ⁶Victoria, ⁷`Western Australia`
visitors %>%
pivot_wider(values_from=Trips, names_from=State) %>%
GGally::ggpairs(columns = 2:9) +
theme(axis.text.x = element_text(angle = 90))
We can add a loess line to the scatterplots to help identify non-linearities with the following code:
visitors %>%
pivot_wider(values_from=Trips, names_from=State) %>%
GGally::ggpairs(columns = 2:9, lower = list(continuous = wrap("smooth_loess", color="lightblue"))) +
theme(axis.text.x = element_text(angle = 90))
Interpretation extracted from ref. 1:
The value of the scatterplot matrix is that it enables a quick view of the relationships between all pairs of variables. In this example, mostly positive relationships are revealed, with the strongest relationships being between the neighbouring states located in the south and south east coast of Australia, namely, New South Wales, Victoria and South Australia. Some negative relationships are also revealed between the Northern Territory and other regions. The Northern Territory is located in the north of Australia famous for its outback desert landscapes visited mostly in winter. Hence, the peak visitation in the Northern Territory is in the July (winter) quarter in contrast to January (summer) quarter for the rest of the regions.